Robotics 20
☆ VINGS-Mono: Visual-Inertial Gaussian Splatting Monocular SLAM in Large Scenes
VINGS-Mono is a monocular (inertial) Gaussian Splatting (GS) SLAM framework
designed for large scenes. The framework comprises four main components: VIO
Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. In the VIO
Front End, RGB frames are processed through dense bundle adjustment and
uncertainty estimation to extract scene geometry and poses. Based on this
output, the mapping module incrementally constructs and maintains a 2D Gaussian
map. Key components of the 2D Gaussian Map include a Sample-based Rasterizer,
Score Manager, and Pose Refinement, which collectively improve mapping speed
and localization accuracy. This enables the SLAM system to handle large-scale
urban environments with up to 50 million Gaussian ellipsoids. To ensure global
consistency in large-scale scenes, we design a Loop Closure module, which
innovatively leverages the Novel View Synthesis (NVS) capabilities of Gaussian
Splatting for loop closure detection and correction of the Gaussian map.
Additionally, we propose a Dynamic Eraser to address the inevitable presence of
dynamic objects in real-world outdoor scenes. Extensive evaluations in indoor
and outdoor environments demonstrate that our approach achieves localization
performance on par with Visual-Inertial Odometry while surpassing recent
GS/NeRF SLAM methods. It also significantly outperforms all existing methods in
terms of mapping and rendering quality. Furthermore, we developed a mobile app
and verified that our framework can generate high-quality Gaussian maps in real
time using only a smartphone camera and a low-frequency IMU sensor. To the best
of our knowledge, VINGS-Mono is the first monocular Gaussian SLAM method
capable of operating in outdoor environments and supporting kilometer-scale
large scenes.
☆ FDPP: Fine-tune Diffusion Policy with Human Preference
Imitation learning from human demonstrations enables robots to perform
complex manipulation tasks and has recently witnessed huge success. However,
these techniques often struggle to adapt behavior to new preferences or changes
in the environment. To address these limitations, we propose Fine-tuning
Diffusion Policy with Human Preference (FDPP). FDPP learns a reward function
through preference-based learning. This reward is then used to fine-tune the
pre-trained policy with reinforcement learning (RL), resulting in alignment of
pre-trained policy with new human preferences while still solving the original
task. Our experiments across various robotic tasks and preferences demonstrate
that FDPP effectively customizes policy behavior without compromising
performance. Additionally, we show that incorporating Kullback-Leibler (KL)
regularization during fine-tuning prevents over-fitting and helps maintain the
competencies of the initial policy.
☆ Data-driven Spatial Classification using Multi-Arm Bandits for Monitoring with Energy-Constrained Mobile Robots
We consider the spatial classification problem for monitoring using data
collected by a coordinated team of mobile robots. Such classification problems
arise in several applications including search-and-rescue and precision
agriculture. Specifically, we want to classify the regions of a search
environment into interesting and uninteresting as quickly as possible using a
team of mobile sensors and mobile charging stations. We develop a data-driven
strategy that accommodates the noise in sensed data and the limited energy
capacity of the sensors, and generates collision-free motion plans for the
team. We propose a bi-level approach, where a high-level planner leverages a
multi-armed bandit framework to determine the potential regions of interest for
the drones to visit next based on the data collected online. Then, a low-level
path planner based on integer programming coordinates the paths for the team to
visit the target regions subject to the physical constraints. We characterize
several theoretical properties of the proposed approach, including anytime
guarantees and task completion time. We show the efficacy of our approach in
simulation, and further validate these observations in physical experiments
using mobile robots.
comment: 8 pages, 6 figures. See https://www.youtube.com/watch?v=gzulpOcVYzg
for an overview of the approach along with videos of the hardware experiments
☆ Hybrid Action Based Reinforcement Learning for Multi-Objective Compatible Autonomous Driving
Reinforcement Learning (RL) has shown excellent performance in solving
decision-making and control problems of autonomous driving, which is
increasingly applied in diverse driving scenarios. However, driving is a
multi-attribute problem, leading to challenges in achieving multi-objective
compatibility for current RL methods, especially in both policy execution and
policy iteration. On the one hand, the common action space structure with
single action type limits driving flexibility or results in large behavior
fluctuations during policy execution. On the other hand, the multi-attribute
weighted single reward function result in the agent's disproportionate
attention to certain objectives during policy iterations. To this end, we
propose a Multi-objective Ensemble-Critic reinforcement learning method with
Hybrid Parametrized Action for multi-objective compatible autonomous driving.
Specifically, a parameterized action space is constructed to generate hybrid
driving actions, combining both abstract guidance and concrete control
commands. A multi-objective critics architecture is constructed considering
multiple attribute rewards, to ensure simultaneously focusing on different
driving objectives. Additionally, uncertainty-based exploration strategy is
introduced to help the agent faster approach viable driving policy. The
experimental results in both the simulated traffic environment and the HighD
dataset demonstrate that our method can achieve multi-objective compatible
autonomous driving in terms of driving efficiency, action consistency, and
safety. It enhances the general performance of the driving while significantly
increasing training efficiency.
comment: 12 pages, 9 figures, 5 tables
☆ HydroelasticTouch: Simulation of Tactile Sensors with Hydroelastic Contact Surfaces
Thanks to recent advancements in the development of inexpensive,
high-resolution tactile sensors, touch sensing has become popular in
contact-rich robotic manipulation tasks. With the surge of data-driven methods
and their requirement for substantial datasets, several methods of simulating
tactile sensors have emerged in the tactile research community to overcome
real-world data collection limitations. These simulation approaches can be
split into two main categories: fast but inaccurate (soft) point-contact models
and slow but accurate finite element modeling. In this work, we present a novel
approach to simulating pressure-based tactile sensors using the hydroelastic
contact model, which provides a high degree of physical realism at a reasonable
computational cost. This model produces smooth contact forces for soft-to-soft
and soft-to-rigid contacts along even non-convex contact surfaces. Pressure
values are approximated at each point of the contact surface and can be
integrated to calculate sensor outputs. We validate our models' capacity to
synthesize real-world tactile data by conducting zero-shot sim-to-real transfer
of a model for object state estimation. Our simulation is available as a
plug-in to our open-source, MuJoCo-based simulator.
☆ CHEQ-ing the Box: Safe Variable Impedance Learning for Robotic Polishing
Robotic systems are increasingly employed for industrial automation, with
contact-rich tasks like polishing requiring dexterity and compliant behaviour.
These tasks are difficult to model, making classical control challenging. Deep
reinforcement learning (RL) offers a promising solution by enabling the
learning of models and control policies directly from data. However, its
application to real-world problems is limited by data inefficiency and unsafe
exploration. Adaptive hybrid RL methods blend classical control and RL
adaptively, combining the strengths of both: structure from control and
learning from RL. This has led to improvements in data efficiency and
exploration safety. However, their potential for hardware applications remains
underexplored, with no evaluations on physical systems to date. Such
evaluations are critical to fully assess the practicality and effectiveness of
these methods in real-world settings. This work presents an experimental
demonstration of the hybrid RL algorithm CHEQ for robotic polishing with
variable impedance, a task requiring precise force and velocity tracking. In
simulation, we show that variable impedance enhances polishing performance. We
compare standalone RL with adaptive hybrid RL, demonstrating that CHEQ achieves
effective learning while adhering to safety constraints. On hardware, CHEQ
achieves effective polishing behaviour, requiring only eight hours of training
and incurring just five failures. These results highlight the potential of
adaptive hybrid RL for real-world, contact-rich tasks trained directly on
hardware.
☆ AI Guide Dog: Egocentric Path Prediction on Smartphone
Aishwarya Jadhav, Jeffery Cao, Abhishree Shetty, Urvashi Priyam Kumar, Aditi Sharma, Ben Sukboontip, Jayant Sravan Tamarapalli, Jingyi Zhang, Anirudh Koul
This paper introduces AI Guide Dog (AIGD), a lightweight egocentric
navigation assistance system for visually impaired individuals, designed for
real-time deployment on smartphones. AIGD addresses key challenges in blind
navigation by employing a vision-only, multi-label classification approach to
predict directional commands, ensuring safe traversal across diverse
environments. We propose a novel technique to enable goal-based outdoor
navigation by integrating GPS signals and high-level directions, while also
addressing uncertain multi-path predictions for destination-free indoor
navigation. Our generalized model is the first navigation assistance system to
handle both goal-oriented and exploratory navigation scenarios across indoor
and outdoor settings, establishing a new state-of-the-art in blind navigation.
We present methods, datasets, evaluations, and deployment insights to encourage
further innovations in assistive navigation systems.
☆ Low-Contact Grasping of Soft Tissue with Complex Geometry using a Vortex Gripper
Soft tissue manipulation is an integral aspect of most surgical procedures;
however, the vast majority of surgical graspers used today are made of hard
materials, such as metals or hard plastics. Furthermore, these graspers
predominately function by pinching tissue between two hard objects as a method
for tissue manipulation. As such, the potential to apply too much force during
contact, and thus damage tissue, is inherently high. As an alternative
approach, gaspers developed using a pneumatic vortex could potentially levitate
soft tissue, enabling manipulation with low or even no contact force. In this
paper, we present the design and well as a full factorial study of the force
characteristics of the vortex gripper grasping soft surfaces with four common
shapes, with convex and concave curvature, and ranging over 10 different radii
of curvature, for a total of 40 unique surfaces. By changing the parameters of
the nozzle elements in the design of the gripper, it was possible to
investigate the influence of the mass flow parameters of the vortex gripper on
the lifting force for all of these different soft surfaces. An $\pmb{ex}$
$\pmb{vivo}$ experiment was conducted on grasping biological tissues and soft
balls of various shapes to show the advantages and disadvantages of the
proposed technology. The obtained results allowed us to find limitations in the
use of vortex technology and the following stages of its improvement for
medical use.
comment: Submitted to T-MRB
♻ ☆ FaVoR: Features via Voxel Rendering for Camera Relocalization WACV
Camera relocalization methods range from dense image alignment to direct
camera pose regression from a query image. Among these, sparse feature matching
stands out as an efficient, versatile, and generally lightweight approach with
numerous applications. However, feature-based methods often struggle with
significant viewpoint and appearance changes, leading to matching failures and
inaccurate pose estimates. To overcome this limitation, we propose a novel
approach that leverages a globally sparse yet locally dense 3D representation
of 2D features. By tracking and triangulating landmarks over a sequence of
frames, we construct a sparse voxel map optimized to render image patch
descriptors observed during tracking. Given an initial pose estimate, we first
synthesize descriptors from the voxels using volumetric rendering and then
perform feature matching to estimate the camera pose. This methodology enables
the generation of descriptors for unseen views, enhancing robustness to view
changes. We extensively evaluate our method on the 7-Scenes and Cambridge
Landmarks datasets. Our results show that our method significantly outperforms
existing state-of-the-art feature representation techniques in indoor
environments, achieving up to a 39% improvement in median translation error.
Additionally, our approach yields comparable results to other methods for
outdoor scenarios while maintaining lower memory and computational costs.
comment: Accepted to the IEEE/CVF Winter Conference on Applications of
Computer Vision (WACV), Tucson, Arizona, US, Feb 28-Mar 4, 2025
♻ ☆ Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation
Sim-to-real gap has long posed a significant challenge for robot learning in
simulation, preventing the deployment of learned models in the real world.
Previous work has primarily focused on domain randomization and system
identification to mitigate this gap. However, these methods are often limited
by the inherent constraints of the simulation and graphics engines. In this
work, we propose Vid2Sim, a novel framework that effectively bridges the
sim2real gap through a scalable and cost-efficient real2sim pipeline for neural
3D scene reconstruction and simulation. Given a monocular video as input,
Vid2Sim can generate photorealistic and physically interactable 3D simulation
environments to enable the reinforcement learning of visual navigation agents
in complex urban environments. Extensive experiments demonstrate that Vid2Sim
significantly improves the performance of urban navigation in the digital twins
and real world by 31.2% and 68.3% in success rate compared with agents trained
with prior simulation methods.
comment: Project page: https://metadriverse.github.io/vid2sim/
♻ ☆ Virtual Reflections on a Dynamic 2D Eye Model Improve Spatial Reference Identification
The visible orientation of human eyes creates some transparency about
people's spatial attention and other mental states. This leads to a dual role
for the eyes as a means of sensing and communication. Accordingly, artificial
eye models are being explored as communication media in human-machine
interaction scenarios. One challenge in the use of eye models for communication
consists of resolving spatial reference ambiguities, especially for
screen-based models. Here, we introduce an approach for overcoming this
challenge through the introduction of reflection-like features that are
contingent on artificial eye movements. We conducted a user study with 30
participants in which participants had to use spatial references provided by
dynamic eye models to advance in a fast-paced group interaction task. Compared
to a non-reflective eye model and a pure reflection mode, their combination in
the new approach resulted in a higher identification accuracy and user
experience, suggesting a synergistic benefit.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model
Safe Reinforcement Learning (SRL) aims to realize a safe learning process for
Deep Reinforcement Learning (DRL) algorithms by incorporating safety
constraints. However, the efficacy of SRL approaches often relies on accurate
function approximations, which are notably challenging to achieve in the early
learning stages due to data insufficiency. To address this issue, we introduce
in this work a novel Generalizable Safety enhancer (GenSafe) that is able to
overcome the challenge of data insufficiency and enhance the performance of SRL
approaches. Leveraging model order reduction techniques, we first propose an
innovative method to construct a Reduced Order Markov Decision Process (ROMDP)
as a low-dimensional approximator of the original safety constraints. Then, by
solving the reformulated ROMDP-based constraints, GenSafe refines the actions
of the agent to increase the possibility of constraint satisfaction.
Essentially, GenSafe acts as an additional safety layer for SRL algorithms. We
evaluate GenSafe on multiple SRL approaches and benchmark problems. The results
demonstrate its capability to improve safety performance, especially in the
early learning phases, while maintaining satisfactory task performance. Our
proposed GenSafe not only offers a novel measure to augment existing SRL
methods but also shows broad compatibility with various SRL algorithms, making
it applicable to a wide range of systems and SRL problems.
♻ ☆ Perception Matters: Enhancing Embodied AI with Uncertainty-Aware Semantic Segmentation
Embodied AI has made significant progress acting in unexplored environments.
However, tasks such as object search have largely focused on efficient policy
learning. In this work, we identify several gaps in current search methods:
They largely focus on dated perception models, neglect temporal aggregation,
and transfer from ground truth directly to noisy perception at test time,
without accounting for the resulting overconfidence in the perceived state. We
address the identified problems through calibrated perception probabilities and
uncertainty across aggregation and found decisions, thereby adapting the models
for sequential tasks. The resulting methods can be directly integrated with
pretrained models across a wide family of existing search approaches at no
additional training cost. We perform extensive evaluations of aggregation
methods across both different semantic perception models and policies,
confirming the importance of calibrated uncertainties in both the aggregation
and found decisions. We make the code and trained models available at
https://semantic-search.cs.uni-freiburg.de.
♻ ☆ DIDLM: A SLAM Dataset for Difficult Scenarios Featuring Infrared, Depth Cameras, LIDAR, 4D Radar, and Others under Adverse Weather, Low Light Conditions, and Rough Roads
Adverse weather conditions, low-light environments, and bumpy road surfaces
pose significant challenges to SLAM in robotic navigation and autonomous
driving. Existing datasets in this field predominantly rely on single sensors
or combinations of LiDAR, cameras, and IMUs. However, 4D millimeter-wave radar
demonstrates robustness in adverse weather, infrared cameras excel in capturing
details under low-light conditions, and depth images provide richer spatial
information. Multi-sensor fusion methods also show potential for better
adaptation to bumpy roads. Despite some SLAM studies incorporating these
sensors and conditions, there remains a lack of comprehensive datasets
addressing low-light environments and bumpy road conditions, or featuring a
sufficiently diverse range of sensor data. In this study, we introduce a
multi-sensor dataset covering challenging scenarios such as snowy weather,
rainy weather, nighttime conditions, speed bumps, and rough terrains. The
dataset includes rarely utilized sensors for extreme conditions, such as 4D
millimeter-wave radar, infrared cameras, and depth cameras, alongside 3D LiDAR,
RGB cameras, GPS, and IMU. It supports both autonomous driving and ground robot
applications and provides reliable GPS/INS ground truth data, covering
structured and semi-structured terrains. We evaluated various SLAM algorithms
using this dataset, including RGB images, infrared images, depth images, LiDAR,
and 4D millimeter-wave radar. The dataset spans a total of 18.5 km, 69 minutes,
and approximately 660 GB, offering a valuable resource for advancing SLAM
research under complex and extreme conditions. Our dataset is available at
https://github.com/GongWeiSheng/DIDLM.
♻ ☆ Evaluation of Artificial Intelligence Methods for Lead Time Prediction in Non-Cycled Areas of Automotive Production
The present study examines the effectiveness of applying Artificial
Intelligence methods in an automotive production environment to predict unknown
lead times in a non-cycle-controlled production area. Data structures are
analyzed to identify contextual features and then preprocessed using one-hot
encoding. Methods selection focuses on supervised machine learning techniques.
In supervised learning methods, regression and classification methods are
evaluated. Continuous regression based on target size distribution is not
feasible. Classification methods analysis shows that Ensemble Learning and
Support Vector Machines are the most suitable. Preliminary study results
indicate that gradient boosting algorithms LightGBM, XGBoost, and CatBoost
yield the best results. After further testing and extensive hyperparameter
optimization, the final method choice is the LightGBM algorithm. Depending on
feature availability and prediction interval granularity, relative prediction
accuracies of up to 90% can be achieved. Further tests highlight the importance
of periodic retraining of AI models to accurately represent complex production
processes using the database. The research demonstrates that AI methods can be
effectively applied to highly variable production data, adding business value
by providing an additional metric for various control tasks while outperforming
current non AI-based systems.
♻ ☆ GazeGrasp: DNN-Driven Robotic Grasping with Wearable Eye-Gaze Interface
Issatay Tokmurziyev, Miguel Altamirano Cabrera, Luis Moreno, Muhammad Haris Khan, Dzmitry Tsetserukou
We present GazeGrasp, a gaze-based manipulation system enabling individuals
with motor impairments to control collaborative robots using eye-gaze. The
system employs an ESP32 CAM for eye tracking, MediaPipe for gaze detection, and
YOLOv8 for object localization, integrated with a Universal Robot UR10 for
manipulation tasks. After user-specific calibration, the system allows
intuitive object selection with a magnetic snapping effect and robot control
via eye gestures. Experimental evaluation involving 13 participants
demonstrated that the magnetic snapping effect significantly reduced gaze
alignment time, improving task efficiency by 31%. GazeGrasp provides a robust,
hands-free interface for assistive robotics, enhancing accessibility and
autonomy for users.
comment: Accepted to: IEEE/ACM International Conference on Human-Robot
Interaction (HRI 2025)
♻ ☆ Cooperative Aerial Robot Inspection Challenge: A Benchmark for Heterogeneous Multi-UAV Planning and Lessons Learned
Muqing Cao, Thien-Minh Nguyen, Shenghai Yuan, Andreas Anastasiou, Angelos Zacharia, Savvas Papaioannou, Panayiotis Kolios, Christos G. Panayiotou, Marios M. Polycarpou, Xinhang Xu, Mingjie Zhang, Fei Gao, Boyu Zhou, Ben M. Chen, Lihua Xie
We propose the Cooperative Aerial Robot Inspection Challenge (CARIC), a
simulation-based benchmark for motion planning algorithms in heterogeneous
multi-UAV systems. CARIC features UAV teams with complementary sensors,
realistic constraints, and evaluation metrics prioritizing inspection quality
and efficiency. It offers a ready-to-use perception-control software stack and
diverse scenarios to support the development and evaluation of task allocation
and motion planning algorithms. Competitions using CARIC were held at IEEE CDC
2023 and the IROS 2024 Workshop on Multi-Robot Perception and Navigation,
attracting innovative solutions from research teams worldwide. This paper
examines the top three teams from CDC 2023, analyzing their exploration,
inspection, and task allocation strategies while drawing insights into their
performance across scenarios. The results highlight the task's complexity and
suggest promising directions for future research in cooperative multi-UAV
systems.
comment: Please find our website at https://ntu-aris.github.io/caric
♻ ☆ Analyzing Infrastructure LiDAR Placement with Realistic LiDAR Simulation Library ICRA'23
Recently, Vehicle-to-Everything(V2X) cooperative perception has attracted
increasing attention. Infrastructure sensors play a critical role in this
research field; however, how to find the optimal placement of infrastructure
sensors is rarely studied. In this paper, we investigate the problem of
infrastructure sensor placement and propose a pipeline that can efficiently and
effectively find optimal installation positions for infrastructure sensors in a
realistic simulated environment. To better simulate and evaluate LiDAR
placement, we establish a Realistic LiDAR Simulation library that can simulate
the unique characteristics of different popular LiDARs and produce
high-fidelity LiDAR point clouds in the CARLA simulator. Through simulating
point cloud data in different LiDAR placements, we can evaluate the perception
accuracy of these placements using multiple detection models. Then, we analyze
the correlation between the point cloud distribution and perception accuracy by
calculating the density and uniformity of regions of interest. Experiments show
that when using the same number and type of LiDAR, the placement scheme
optimized by our proposed method improves the average precision by 15%,
compared with the conventional placement scheme in the standard lane scene. We
also analyze the correlation between perception performance in the region of
interest and LiDAR point cloud distribution and validate that density and
uniformity can be indicators of performance. Both the RLS Library and related
code will be released at https://github.com/PJLab-ADG/PCSim.
comment: 7 pages, 6 figures, accepted to the IEEE International Conference on
Robotics and Automation (ICRA'23)
♻ ☆ Cost-Effective Robotic Handwriting System with AI Integration
This paper introduces a cost-effective robotic handwriting system designed to
replicate human-like handwriting with high precision. Combining a Raspberry Pi
Pico microcontroller, 3D-printed components, and a machine learning-based
handwriting generation model implemented via TensorFlow, the system converts
user-supplied text into realistic stroke trajectories. By leveraging
lightweight 3D-printed materials and efficient mechanical designs, the system
achieves a total hardware cost of approximately \$56, significantly
undercutting commercial alternatives. Experimental evaluations demonstrate
handwriting precision within $\pm$0.3 millimeters and a writing speed of
approximately 200 mm/min, positioning the system as a viable solution for
educational, research, and assistive applications. This study seeks to lower
the barriers to personalized handwriting technologies, making them accessible
to a broader audience.
comment: This is an updated version of a paper originally presented at the
2024 IEEE Long Island Systems, Applications and Technology Conference (LISAT)
♻ ☆ Tactile-based Exploration, Mapping and Navigation with Collision-Resilient Aerial Vehicles
This article introduces XPLORER, a passive deformable UAV with a
spring-augmented chassis and proprioceptive state awareness, designed to endure
collisions and maintain smooth contact. We develop a fast-converging external
force estimation algorithm for XPLORER that leverages onboard sensors and
proprioceptive data for contact and collision detection. Using this force
information, we propose four motion primitives, including three novel
tactile-based primitives: tactile-traversal, tactile-turning, and
ricocheting-to aid XPLORER in navigating unknown environments. These primitives
are synthesized autonomously in real-time to enable efficient exploration and
navigation by leveraging collisions and contacts. Experimental results
demonstrate the effectiveness of our approach, highlighting the potential of
passive deformable UAVs for contact-rich real-world tasks such as
non-destructive inspection, surveillance and mapping, and pursuit/evasion.